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Abstract 

We show how to accelerate the direct solution of the Boltzmann equation using 
^_i' Graphics Processing Units (GPUs). In order to fully exploit the computational 



power of the GPU, we choose a method of solution which combines a finite dif- 
ference discretization of the free-streaming term with a Monte Carlo evaluation of 
Q ■ the collision integral. The efficiency of the code is demonstrated by solving the two- 

O , dimensional driven cavity flow. Computational results show that it is possible to cut 

c/3 I down the computing time of the sequential code of two order of magnitudes. This 

• ,— I ' makes the proposed method of solution a viable alternative to particle simulations 

^.^1 for studying unsteady low Mach number flows. 
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Q ' 1 Introduction 



Non-equilibrium gas flows are met in several different physical situations rang- 



^H ' ing from the re-entry of spacecraft in upper planetary atmospheres to fluid- 



structure interaction in small-scale devices [1,2]. The correct description of 
nonequilibrium effects requires replacing the traditional hydrodynamic equa- 
tions with the Boltzmann equation which, in the absence of assigned external 
force flelds, reads 

^ + vVrf = C{fJ) (1) 
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In Eq. (1), the distribution function f{r,v\t) is the atomic number density 
at the single atom phase space point (r, v) at time t. The symbols r and v 
denote atom position and velocity, respectively. The left hand side of Eq. (1) 
represents the rate of change of / due to the indipendent motion of gas atoms. 
Effects of collisions are accounted for by the source term C(/, /) which is a non- 
linear functional of / whose precise structure depends on the assumed atomic 
interaction forces. Obtaining numerical solutions of Eq. (1) for realistic flow 
conditions is a challenging task because it has the form of a non-linear integro- 
differential equation in which the unknown function, /, depends on seven 
variables. Numerical methods used to solve Eq. (1) can be roughly divided 
into three groups: 

(a) Particle methods 

(b) Semi-regular methods 

(c) Regular methods 

Methods in group (a) originate from the Direct Simulation Monte Carlo (DSMC) 
scheme proposed by G.A. Bird [2]. They are by far the most popular and 
widely used simulation methods in rarefied gas dynamics. The distribution 
function is represented by a number of mathematical particles which move in 
the computational domain and collide according to stochastic rules derived 
from Boltzmann equation. Macroscopic flow properties are usually obtained 
by time averaging particle properties. If the averaging time is long enough, 
then accurate flow simulations can be obtained by a relatively small number 
of particles. The method can be easily extended to deal with mixtures of chem- 
ically reacting polyatomic species [2] and to dense fluids [3]. Although DSMC 
(in its traditional implementation) is to be recommended in simulating most 
of rarefied gas flows, it is not well suited to the simulation of low Mach num- 
ber or unsteady flows. Attempts have been made to extend DSMC in order 
to improve its capability to capture the small deviations from the equilibrium 
condition met in low Mach number flows [4,5]. However, in simulating high 
frequency unsteady flows, typical of microfluidics application to MEMS, the 
possibility of time averaging is lost or reduced. Acceptable accuracy can then 
be achieved by increasing the number of simulation particles or superposing 
several flow snapshots obtained from statistically independent simulations of 
the same flow; in both cases the computing effort is considerably increased. 
Methods in groups (b) and (c) adopt similar strategies in discretizing the dis- 
tribution function on a regular grid in the phase space and in using finite 
difference schemes to approximate the streaming term. However, they differ 
in the way the collision integral is evaluated. In semi-regular methods C{f, f) 
is computed by Monte Carlo or quasi Monte Carlo quadrature methods [6,7] 
whereas deterministic integration schemes are used in regular methods [8]. 
Whatever method is chosen to compute the collision term, the adoption of a 
grid in the phase space considerably limits the applicability of methods (b) and 
(c) to problems where particular symmetries reduce the number of spatial and 



velocity variables. As a matter of fact, a spatially three-dimensional problem 
would require a memory demanding six-dimensional phase space grid. Exten- 
sions to polyatomic gases are possible [9] but the necessity to store additional 
variables associated with internal degrees of freedom further limits the appli- 
cations to multi-dimensional flows. Therefore, until now the direct solution 
of the Boltzmann equation by semi-regular or regular methods has not been 
considered a viable alternative to DSMC for simulating realistic flows, not 
even for low speed and/or unsteady flows. The availability of low cost Graph- 
ics Processing Units (GPUs) has changed the situation. Although CPUs have 
been originally developed for graphics applications, they have been increas- 
ingly used to do general purpose scientific and engineering computing [10,11]. 
Mapping efficiently an algorithm on the SIMD-like architecture of the GPUs, 
however, is a difficult task which often requires the algorithm to be revised 
or even redesigned to both balance the hardware structure benefits and meet 
the implementation requirements. For instance, preliminary tests, performed 
within the framework of the research work described here, have shown that 
the standard form of DSMC is not efficiently ported on GPU's because of their 
SIMD-like architecture. On the other hand, we have shown in Ref. [12] that 
a regular method of solution of the BGKW kinetic model equation is ideally 
suited for GPUs. The main aim of the present paper is to translate efficiently 
a semi-regular method of solution of the full non-linear Boltzmann equation 
into a parallel code to be executed on a GPU. The efficiency of the algorithm 
is assessed by solving the classical two-dimensional driven cavity flow. It is 
shown that it is possible to cut down the computing time of the sequential 
code of two order of magnitudes. This paper is organized as follows. Sections 
2 and 3 are devoted to a concise description of the mathematical model and 
the adopted numerical method. In Section 4 the key aspects of the GPU hard- 
ware architecture and CUDA^^ programming model are briefly described and 
implementation details are provided. Sections 5 is devoted to the description 
of the test problem and the discussion of the results. Concluding remarks are 
presented in Section 6. 



2 Mathematical Formulation 



The hard-sphere model is a good approximation for simple fluids, that is flu- 
ids whose properties are largely determined by harshly repulsive short range 
forces. The hard-sphere Boltzmann collision integral reads 



C(/, /) = Y / iffi - fh) \k ■ Vr\dv,d'k (2) 

In Eq. (2), a is the hard sphere diameter, Vr = v—Vi is the relative velocity be- 
tween two coUiding atoms and /* = /(r, v*\t), fl = f{r, vl\t), /i = /(r, Vi\t). 



Here and in the remainder of the paper, integration extends over the whole ve- 
locity space. Similarly, the solid angle integration is over the surface of the unit 
sphere, whose points are associated with the unit vector k. The pre-collisional 
velocities, {v*,vl), are obtained from the post-collision velocities, {v,Vi), and 
the unit vector on the sphere, k, by the relationships 



v* = v+[vr-kjk (3) 

vl = vi- (vr ■ k) k (4) 



In view of the applications to the study of low Mach flows, Refs. [4,13] will 
be followed to rewrite Eqs. (1) and (2) in terms of the deviational part of the 
distribution function, h{r,v\t), defined as 

/ = $o(l + e/i) (5) 

where e is a parameter that measures the deviation from equilibrium conditions 
and ^Q{r,v) is the Maxwellian at equilibrium with uniform and constant 
density Uq and temperature Tq, i.e.. 



The physical rationale behind this formulation is a proper rescaling of the 
(small) deviation from equilibrium to reduce the variance in the Monte Carlo 
evaluation of the collision integral and thus to capture arbitrarily small devia- 
tions from equilibrium with a computational cost which is independent of the 
magnitude of the deviation. By substituting Eq. (5) into Eq. (1), we obtain 

dh 

-^ + v-Vrh=Q{h,h) (7) 

where, by using the property $o "^oi — *^o *^oi) the collision integral takes the 
form 
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Q{h, /i) = y /" $0 $01 [h* + hl-h-hi + e {h*hl - hhi)] \k ■ Vr\dvid^k (8) 

For later reference, we here report the expressions of dimensionless perturbed 
density, velocity, temperature and stress tensor 
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where po = uoRTq. At the boundaries, Maxwell's completely diffuse boundary 
condition is assumed. Accordingly, the distribution function of atoms emerging 
from walls is given by the following expression 



$o + e$o/i = n™$^ {v-Vu,)-fi>Q (13) 

In Eq. (13), n is the inward normal and $^ is the normalized wall Maxwellian 
distribution function 



''-^"'"^= (2vri?rj3/2 ^"P 



{v - V. 



(14) 
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where Vw the wall velocity and T^ the wall temperature. The wall density n^ 
is determined by imposing zero net mass flux at any boundary point 

nw j \c- fi\(^yjdv = f \c ■ h\^()dv + e \c-n\^ohdv (15) 

Jch>0 Jcfi<Q Jcn<0 

where c = v — Vw It is worth noticing that when the perturbation is suffi- 
ciently small, i.e., e — )■ 0, Eq. (7) reduces to the linearized Boltzmann equation 
and Eqs. (9)-(12) to the linearized expression of the macroscopic quantities. 
The formulation in terms of the deviational part of the distribution function, 
however, is not restricted to a vanishing perturbation but it is valid in the 
non-linear case as well. 



3 Outline of the numerical method 



The method of solution adopted to solve Eq. (7) is a semi-regular method in 
which a finite difference discretization is used to evaluate the free streaming 
term on the left hand side while the collision integral on the right hand side is 
computed by a Monte Carlo technique. The three-dimensional physical space 
is divided into Nr = N^ x Ny x N^ parallelepipedal cells. Likewise, the three- 
dimensional velocity space is replaced by a parallelepipedal box divided into 
Ny = Ny^ X Nyy X Ny^ cells. The size and position of the "velocity box" in the 
velocity space have to be properly chosen, in order to contain the significant 



part of h at any spatial position. The distribution function is assumed to be 
constant within each cell of the phase space. Hence, h is represented by the ar- 
ray hij{t) = h{x{Q,y{iy),z{i^),v^{j^),Vy{jy),V;,{j^)\t); x{Q,y{iy), z{i^) and 
Vxijx) , Vy{jy) , Vz{jz) are the values of the spatial coordinates and velocity com- 
ponents in the center of the phase space cell corresponding to the indexes 
i = {ix,iy,iz) and j = (jxjyjz)- 

The algorithm that advances h^j = hij{tn) to h^j^ = hij{tn + At) is con- 
structed by time-splitting the evolution operator into a free streaming step, in 
which the right hand side of Eq. (7) is neglected, and a purely collisional step, 
in which spatial motion is frozen and only the effect of the collision operator is 
taken into account. More precisely, the distribution function /i^ ■ is advanced 
to /i"t^ by computing an intermediate value, Kl'j^, from the free streaming 
equation 

— + v-Vrh = (16) 

at 

When solving Eq. (16), boundary conditions have to be taken into account. 

Eq. (16) is discretized by a simple first order explicit upwind conservative 

scheme. For later reference, we here report the difference scheme in the two 

dimensional case with Vx > and Vy > 



^ir!i;i = (1 - Cu. - Cu,)/.I1,,^^, + Cu. hl_,^,^^,^ + Cu, /.Il,,^_,^, (17) 

In Eq. (17), Cu^; = Vx{jx)^t/Ax and Cuj^ = Vy{jy)At/Ay are the Courant 
numbers in the x and y directions, respectively. 

After completing the free streaming step, h^~j^ is obtained by solving the 
homogeneous relaxation equation 

^ = Q(h,h) (18) 

where Q{h,h) is given by Eq. (8). In order to be solved, Eq. (18) is first 
integrated over the cell of the velocity space Cj 



dNij 
dt 



[ Q{h,h)dv (19) 



where Nij represents the deviation of the number of particles with position 
rj in the velocity cell centered around the velocity node j with respect to 
its mean value at equilibrium, i.e., Nij ^ AVj $oj hij with AVj the volume 
of the velocity cell Cj. The integral in Eq. (19) is then transformed into an 
integral extended to the whole velocity domain V 



dNij 



dt Jv 



fxjQ{h,h)dv (20) 

J (/ 



where Xj is the characteristic function of the cell Cj 



Xj{v) 



1 veCj 

v^Cj 



(21) 



Making use of some fundamental properties of the collision integral [1] , Eq. (20) 
can be written in the following form 



—^ = — / dvdvi $o(t^) $o(*^i, 



dk^ 



27r 
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[Xj{v*) + XjK) - Xj{v) - Xj{vi)] IKv) + h{v^) + €h{v)h{v^)] \k ■ Vr\ (22) 

The eight-fold integral in Eq. (22) is calculated by a Monte Carlo integration 
method, since a regular quadrature formula would be too demanding in term 
of computing time. The advantage of writing the rate of change of Nij in the 
above form is that the gaussian distribution function $o may be considered 
a probability density function from which the velocity points are drawn to 
estimate the collision integral with lower variance. The Monte Carlo estimate 
of the integral on the right hand side of Eq. (22) gives 



dN- ■ n^d^TT ^* 



dt 



N^ 



i 1=1 



Xj{vi) - Xj{vii)] 
[h{vi) + h{vu) + e h{vi) h{vii)] \k ■ v.,] (23) 



where Nt is the number of velocity samples [6]. It is worth noticing that the 
same set of collisions can be used to evaluate the collision integral at different 
space locations. Once the collision integral have been evaluated, the solution is 
advanced from the n-th time level to the next according to the explicit scheme 
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(24) 



In Eq. (24), dNij/dt is given by Eq. (23) with h the deviational part of the 
distribution function at the end of the streaming step, that is h^t^- Although 
memory demanding, the method outlined above produces accurate approxi- 
mations of the solution which do not require time averaging to provide smooth 
macroscopic fields. A drawback of the technique is that, due to the discretiza- 
tion in the velocity space, momentum and energy are not exactly conserved. 
The numerical error is usually small but tends to accumulate during the time 
evolution of the distribution function. The correction procedure proposed in 
Ref. [14] has been adopted to overcome this difficulty. At each time step the 
full distribution function is corrected in the following way 
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1 + A + B-v + Cv' 



(25) 



where the constants A, B and C are determined from the conditions 

/ 4j{v) $o(v) h"'+\v) dv= f 4j{v) <^o{v) h''+\v) dv (26) 

being ip{v) = l,v, v^ . The correction procedure given by Eq. (25) involves the 
full distribution function and not only its deviational part in order the linear 
system (26) to be well conditioned. 

4 GPU and CUDA^"" 

J^.l Overview of GPU computing 

NVIDIA®GPU is built around a fully programmable processors array orga- 
nized into a number of multiprocessors with a SIMD-like architecture, i.e. 
at any given clock cycle, each core of the multiprocessor executes the same 
instruction but operates on different data. CUDA^^ is the high level program- 
ming language specifically created for developing applications on this platform 
[15]. 

A CUDA^'''' program is organized into a serial program which runs on the 
host CPU and one or more kernels which define the computation to be per- 
formed in parallel by a massive number of threads. Threads are organized into 
a three-level hierarchy. At the highest level, all threads form a grid; they all 
execute the same kernel function. Each grid consists of many different blocks 
which contain the same number of threads. A single multiprocessor can man- 
age a number of blocks concurrently up to the resource limits. Blocks are 
independent, meaning that a kernel must execute correctly no matter the or- 
der in which blocks are run. A multiprocessor executes a group of threads 
beloging to the active block, called warp. All threads of a warp execute the 
same instruction but operate on different data. If a kernel contains a branch 
and threads of the same warp follow different paths, then the different paths 
are executed sequentially (warp divergence) and the total run time is the sum 
of all the branches. Divergence and re-convergence are managed in hardware 
but may have a serious impact on performances. When the instruction has 
been executed, the multiprocessor moves to another warp. In this manner the 
execution of threads is interleaved rather than simultaneous. 
Each multiprocessor has a number of registers which are dynamically parti- 
tioned among the threads running on it. Registers are memory spaces that are 
readable and writable only by the thread to which they are assigned. Threads 
of a single block are allowed to synchronize with each other and are available 
to share data through a high-speed shared memory. Threads from different 
blocks in the same grid may coordinate only via operations in a slower global 
memory space which is readable and writable by all threads in a kernel as 



well as by the host. Shared memory can be accessed by threads within a block 
as quickly as accessing registers. On the contrary, I/O operations involving 
global memory are particularly expensive, unless access is coalesced [15]. Be- 
cause of the interleaved warp execution, memory access latency is partially 
hidden, i.e., threads which have read their data can be performing computa- 
tions while other warps running on the same multiprocessor are waiting for 
their data to come in from global memory. Note, however, that GPU global 
memory is still ten time faster than the main memory of recent CPUs. 
Code optimization is a delicate task. In general, applications which require 
many arithmetic operations between memory read/write, and which minimize 
the number of out-of-order memory access, tend to perform better. Number 
of blocks and number of threads per block have to be chosen carefully. There 
should be at least as many blocks as multiprocessors in the device. Running 
only one block per multiprocessor can force the multiprocessor to idle during 
thread synchronization and device memory reads. By increasing the number 
of blocks, on the other hand, the amount of available shared memory for each 
block diminishes. Allocating more threads per block is better for efficient time 
slicing, but the more threads per block, the fewer registers are available per 
thread. 



4-2 CUDA^'^ implementation 



The code to numerically solve Eq. (7) is organized into a host program, which 
deals with all memory management and other setup tasks, and three kernels 
running on the GPU which perform the streaming and the collision steps. In 
the following, we report and discuss the pseudo-codes of each kernel. Because 
of their different impact on the code performance, we distinguish the slow 
global memory reads, <^, and writes, =^, from the fast reads, ^, and writes, 
— )■, from local registers and shared memory. 

Algorithm (1) reports the pseudocode of the two dimensional streaming kernel. 
The one-dimensional case has been discussed in Ref. [12] whereas the exten- 
sion to three-dimensional geometries is straightforward. Moreover, for clarity 
of presentation, the pseudo-code of the streaming kernel refers to one cell of 
the velocity space with Vx > and Vy > 0. The other cases can be handled 
analogously. As shown by Eq. (17), for each cell of the velocity domain, the 
streaming step involves the distribution function evaluated at different space 
locations. Similarly to the one dimensional case, the key performance enhanc- 
ing strategy is to allow threads to cooperate in the shared memory. In order 
to fit into the device's resources, blocks are composed by a two dimensional 
grid of threads with dimension B^ x By having each thread associated with 
one cell of the physical space. When a block become active, each thread loads 
one element of the distribution function from global memory, stores it into 



shared memory (line 5), updates its value according to Eqs. (17) (line 21) and 
then saves it back to the global memory (line 22). This procedure is repeated 
sequentially {N^^/ B^ — 1) x (Ny/By — 1) times. To ensure non-overlapping ac- 
cess, threads are synchronized at the onset of writing to the global memory 
(lines 20). In order to obtain a coalesced access to the global memory, values 
of the discretized distribution function of cells which are adjacent in the phys- 
ical space are stored in contiguous memory locations. However, not all the 
threads in a block can read data in a coalescent manner. In fact, in order to 
update h^j the values of the distribution function of two upwind neighboring 
nodes, often referred to as "halo" nodes [16], are required. The halos on one 
physical direction can be read in with coalesced access (line 13-19) while the 
others have to be read in with non-coalesced access (line 6-12). Threads which 
update boundary points perform calculations which are slightly different to 
account for the incoming Maxwellian flux from the boundaries of the domain 
(lines 8 and 15). 

The relaxation step is organized into two kernels whose pseudo-codes are listed 
in Algorithms (2) and (3). The first kernel computes the sequence of Nt colli- 
sions used in the Monte Carlo evaluation of the collision integral. The second 
kernel updates the discretized distribution function, executes the correction 
procedure and computes the macroscopic quantities of interest as well. 
Algorithm (2) reports the pseudo-code of the sampling kernel. Here, there are 
as many threads as the number of the collision samples, Nt. Firstly, each thread 
generates the pre-collisional velocities v and Vi by sampling the maxwellian 
distribution function with the Box-Muller algorithm (line 1-2) and the unit 
vector k by sampling the uniform distribution on the unit sphere (line 3). 
Afterwards, the post-coUisional velocities are evaluated (line 4-6) and the in- 
dex of the velocity cells to which they belong are calculated and stored in the 
vectors /, /i, /* and /J" defined in the global memory (lines 7-10). Finally, for 
each velocity cell, the values to be added and subtracted to these velocity cells 
are calculated (lines 11-14) and stored (lines 15-18) in the vectors C, Ci, C* 
and C^ defined in the global memory. It is important to note that in order to 
maximize the performance all the accesses to the global memory are done in 
a coalesced manner [15]. 

To update the discretized distribution function in a cell of the physical space 
according to Eq. (24), no information from nearby space cells is required. This 
naturally fits for the GPU, where one may define as many threads as the 
number of cells in the physical space. Moreover, by having one thread for each 
cell of the physical space, potentially dangerous conflicts between threads are 
avoided and the accesses to the global memory may be coalesced. Firstly, each 
thread updates the discretized distribution function according to Eq. (23), 
then executes the correction procedure to enforce the conservation of momen- 
tum and energy, Eq. (25), and finally compute the macroscopic quantities of 
interest. Algorithm (3) shows the pseudo-code of the relaxation kernel. The 
main part of the algorithm is in between the lines 7 and 21 where the collision 
integral is evaluated according to the Monte Carlo method, Eq. (23). Lines 
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from 1 to 6 and from 22 to 30 evaluate the required moments of the distribu- 
tion function before and after the colhsion step, respectively. These moments 
are used in Eq. (26) to obtain the constants A, B and C. The last loop over 
the velocity space (lines 32-39) corrects the distribution function according to 
Eq. (25) and compute the macroscopic quantities of interest. 
The computations that are shown below, have been performed on a commer- 
cially available GPU GeForce GTX 260 produced by NVIDIA® using CUDA^"* 
version 2.0. The GTX 260 GPU model consists of 24 streaming multiproces- 
sors with 8 streaming processors (SP) each for a total of 192 units. Each SP is 
clocked at 1.242 GHz and performs up to 3 floating point operation (FLOP) 
per clock cycle, yielding a peak theoretical performance of 715.4 GFLOPs 
(192 X 1.242 X 3). Each group of SP shares one 16 kB of fast per-block shared 
memory while the GPU has 896 MB of device memory with a memory band- 
width of 111.9 GB/s. The graphic processing unit has been hosted by a per- 
sonal computer equipped with 4 GB of main memory and an Intel® Core Duo 
Quad Q9300 CPU, running at 2.5 GHz. The host machine has also been used 
to run the sequential version of the program to obtain the speed-up data. The 
host code has been compiled using the gcc/g-|--|- compiler with optimization 
option "-03". 



5 Test case: driven cavity flow 

5.1 Formulation of the problem 



The driven cavity flow is a classical benchmark problem. In spite of its sim- 
ple geometry, in fact, it contains most of the features of more complicated 
problems described by kinetic equations [17]. In the following, we restrict to 
the spatially two-dimensional case. We thus consider a monatomic rarefied gas 
contained in a square enclosure with length L, i.e., r G [—L/2,L/2] x [0,L]. 
All the walls are kept at uniform and constant temperature Tq. Initially, the 
gas is supposed to be in uniform equilibrium with density no and temperature 
Tq. The flow is driven by the translation of the lid of the cavity with constant 
velocity Vy^. We describe the dynamics of the gas by Eq. (7) and assume that 
atoms which strike the walls are re-emitted according to the Maxwell's scat- 
tering kernel with complete accomodation, Eq. (13). 

As characteristic length we choose Aq = /io/po(2-RTo)^/^, with /io the viscosity 
of the hard sphere gas [18]. Likewise the characteristic time is given by Ao/Vq, 
where Vq = {2RTqY^'^. The cavity flow problem has been solved for three val- 
ues of the rarefaction parameter 6 = L/Xq = 0.1, 1, 10, being 6 proportional 
to the reciprocal of the Knudsen number. Since the proposed method of so- 
lution is particularly effective in capturing small deviations from equilibrium, 
we set the lid velocity to Vw/Vq = 0.01. The computations described in below. 
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hence, refer to very low Mach number driven cavity flows. The square cavity 
has been divided into N^ = N^ x Ny uniform square cells, with A^^, = A'j^. 
Likewise the velocity space has been divided into A'^ = N^^ x N^ x Ny^ 
with A''^^ = N^y = Ny^. Since the deviation form equilibrium is supposed 
to be small, a cubic velocity space has been constructed by distributing the 
velocity nodes along each velocity component in the interval [— 3Vo,3Vo]. In 
order to achieve a faster convergence of the solutions in the velocity space, the 
lengths of the cells are uniformly stretched with a progression ratio Vy, being 
the smaller cells located at the origin of the velocity space. More precisely, it 
has been chosen Ny = 8000 and r^ = 0.945 for 6 = 0.1 and 6=1 whereas 
A'^ = 5832 and r„ = 0.970 for 6 = 10 The number of collisions used in the 
Monte Carlo evaluation of the collision integral have been varied with the 
rarefaction parameter. In particular, it has been set N^ = 1024 for 6 = 0.1, 
Nt = 6144 for 5 = 1 and A'^t = 8192 for 5 = 10. Finally, the time step has been 
determined by setting the maximum Courant number to 0.5. 



5.2 Results and discussion 



In this section, we first carry out a convergence analysis of the method in the 
physical space and then we investigate the parallel performances of the code. 
In order to establish the convergence rate we compute two global fiowfield 
properties, namely the mean dimensionless shear stress on the moving wall, 
D, and the dimensionless flow rate of the main vortex, G. The two quantities 
are defined as 

D = l f'u^y{x,6\t)dx, G = l f\u^{5/2,y\t)\dy (27) 

JO Jq 

The absolute relative error in the long-term mean values of D and G are shown 
in Figs la and lb, respectively, versus the spatial grid size, h/5 = l/\/Nr, and 
for 5 = 0.1 (circles), 6 = 1 (squares) and S = 10 (triangles). The exact values 
of D and G, which are referred to as Dg and Ge, have been extrapolated from 
the linear fit of the results when h ^ 0. The linear behaviour of the absolute 
relative errors demonstrates that the results are in the asymptotic range of 
convergence and the method is first order accurate [19]. The finest physical 
grid size provides predictions which are accurate only within few percent. More 
precisely, the largest error in D is of the order of 4% and is attained at 5 = 10 
whereas the one in G is 2% at (5 = 0.1. The error is mainly due to the physical 
and velocity discretizations. As a matter of fact the statistical error associated 
with the finite sample size used in the Monte Carlo evaluation of the collision 
integral does not affect the results significantly. The standard deviation of D 
and G with respect to their averaged values in stationary conditions, in fact, 
is negligible small. For instance, the standard deviation oi D aX 6 = 1 from its 
long-term mean value is less than 0.05%. The grid resolution in the physical 



12 



and the velocity domains are the more accurate discretization compatible with 
the GPU global memory constraint, i.e., iV^ = 25600, N„ = 8000 for 6 = 0.1 
and 6 = 1 whereas A^^ = 36864, A^^ = 5832 for 6 = 10. Therefore, in order to 
improve the accuracy of the numerical solutions, we have adopted a nonuni- 
form grid in the physical space. The lengths of the physical cells are uniformly 
stretched with progression ratios that depend on the rarefaction parameter, 
r^ = 0.990, ry = 0.995 for 5 = 10 and r^ = 0.996, ry = 0.998 otherwise. The 
smaller cells are located close to the upper corners of the cavity where severe 
gradients are anticipated. All the results which follow have been obtained with 
these discretizations. Table 1 compares the predictions of the long-term mean 
values of D and G with the extrapolated exact values, D^ and Ge, and the 
results reported in Ref. [17] where the linearized BGKW equation has been 
solved with a discrete velocity method. The accuracy of the numerical solution 
of the non-linear Boltzmann equation can be estimated to be within 2%. The 
agreement with the BGKW results is good, which is not unexpected. Since 
the velocity of the lid is small, in fact, the gas is in a weakly nonequilibrium 
state and the solution of the non-linear Boltzmann equation approaches the 
solution of the linearized BGKW equation. Figures 2a and 2b show the pro- 
files of the dimensionless horizontal component of the velocity, Kr/Ko? along 
the vertical line crossing the center of the cavity and the dimensionless ver- 
tical component of the velocity, Vy/Vyj, along the horizontal line crossing the 
center of the main vortex which forms in the cavity, respectively. Lines are 
the solutions of the non-linear Boltzmann equation, whereas symbols are the 
results reported in Ref. [17]. The results refer to two different values of the 
rarefaction parameters, 6 = 0.1 (dashed lines and squares) and 6 = 10 (solid 
lines and circles). The agreement is very good. 

Although the convergence analysis has been performed by referring to quan- 
tities evaluated in stationary conditions, it is important to point out that the 
proposed method of solution provides accurate results in the unsteady regime 
as well. As an example. Figure 3 shows the evolution of D during the simu- 
lation for 6 = 10. Such a result would be difficult to obtain with a particle 
method where computationally expensive ensemble averanging are needed to 
provide smooth macroscopic fields. 

The performance of the GPU implementation is compared against the single- 
threaded version running on the CPU by computing the speed-up factor 
S = Tcpu/Tgpu, where Tcpu and Tgpu are the times used by the CPU and 
GPU, respectively. Times are measured after initial setup, and do not include 
the time required to transfer data between the disjoint CPU and GPU mem- 
ory spaces. 

We analyse separately the streaming and the collision step, the latter compris- 
ing both the sampling and the collision kernel. Figure 4 reports the obtained 
speed-up data as a function of the number of spatial grid points Nr ioi 6 = 1. 
The speed-up grows rapidly with A^^ and than levels up at about 450 if A'',. 
approximately exceeds 10^. This behavior is the result of the parallel set up of 
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the collision step in A''^ independent threads one for each cell of the physical 
space. As discussed below, the collision step absorbs most of the computa- 
tional resources and its execution strongly affects the overall performances. 
As shown by the speed-up curve, the GPU power is not fully exploited till 
the number of concurrent threads reaches a threshold. Beyond, the speed-up 
saturates and the computing time approximately behaves as a linear function 
of Nr. This behaviour is similar to the one reported in Refs. [20,21]. Figure 5 
shows the relative time which is spent on the streaming step (dark bar) and 
on the collision step (light bar) as well as the total execution time in seconds 
(numbers over the bars) for 5 = 1. As expected, the collision step is more 
time consuming than the streaming step which takes at most 36% of overall 
computing time. 

A strongly simplified evaluation of ideal performances of the streaming and 
collision step can be carried out as follows. A single application of the upwind 
scheme requires the execution of 11 floating point operations and 2.3 accesses 
to the global memory. The GPU delivers 715.4 GFLOPs but the transfer rate 
to/from the main memory is limited to 111.9 GB/s. Since in the case of the 
streaming step the ratio of number of floating point operations to the number 
of bytes accessed is low (11 : 9.2), it is reasonable to obtain the number of 
floating point operation per second from the transfer rate alone. Hence, the 
ideal number of GFLOPs can be obtained by assuming that 11 floating point 
operations will be executed in the time required to transfer 9.2 bytes from the 
main memory. Accordingly, this simple argument yields an ideal performance 
of the streaming step of 133 GFLOPs. A similar performance analysis can be 
applied to the collision step which encompasses both the sampling and col- 
lision kernel. In order to update the distribution function and compute the 
macroscopic quantities of interest, the number of FLOPs that the GPU ex- 
ecutes at any given time step and for each cell in the physical space is the 
sum of two contributions. The first is proportional to the number of velocity 
cells, Ny, and the second one is proportional to the number of collisions used 
to evaluate the collision integral, Nt. The resulting number of FLOPs for each 
time step are of the order of Nr{80N^ + 12Nt). Likewise the number of bytes 
accesses to the global memory per time step is of the order of Nr{8Ny + GiNt) 
Arguments similar to those above lead to estimate an ideal performace of the 
colhsion step of about 174.7 GFLOPs. 

Timing the execution of the separate kernels and counting the number of as- 
sociated floating point operations provides the real performance. The results 
are reported in Fig. 6 where GFLOPs are shown as a function of the number 
of grid points, A^,.. Solid line with circles, dashed line with squares and dot- 
dashed line with triangles are the measured performances of the streaming 
step, the collision step and the overall code, respectively. It is possible to note 
that the performance of the streaming step grows with Nr and quickly levels 
at about 30 GFLOPs, approximately one fourth of the estimated ideal per- 
formance. The difference can be justified by observing that the real CUDA^^ 
implementation of the finite difference scheme is not free from thread diver- 
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gence and ancillary tasks whose effects can be evaluated with difficulty [16]. 
The collision step performance closely patterns the speed-up behavior, that 
is it rapidly grows in the range N^ < 10^ and then levels up at about 140 
GFLOP/s, reasonably close to the theoretical estimate. The collision kernel 
performs better than the streaming kernel due to its higher FLOP to memory 
operation ratio which, in turn, allows a more efficient use of GPU computing 
power. The absence of thread divergence is also a feature which positively 
affects performances. 



6 Conclusions 



In this paper we have assessed the possibility of exploiting the computational 
power of modern GPUs to simulate nonequilibrium rarefied gas flows. The 
full nonlinear Boltzmann equation has been solved by means of a semi-regular 
method which combines a finite difference discretization of the free-streaming 
term with a Monte Carlo evaluation of the collision integral. This method 
of solution is ideally suited for the SIMD-like architecture provided by the 
commercially available GPUs. The two dimensional driven cavity flow has 
been used as a benchmark problem. The results lead to concluding that the 
porting of the sequential code onto GPUs allows a reduction of the computing 
time of two orders of magnitude, being the observed speed-up as high as 400. 
Although the test problem examined here has clearly shown that the size 
of physical memory is the main obstacle toward the application to complex 
two or three-dimensional flows, the numerical method described above can 
be reformulated as a less memory demanding particle scheme in many ways. 
Hence, the present work and results are a first step toward the construction 
of a more flexible and efficient method for the numerical solution of kinetic 
equations. 
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Captions to Figures: 



Fig. 1: Absolute relative error on (a) drag coefficient and (b) mean flow rate 
for 6 = 0.1 (circles), S = 1 (squares) and 6 = 10 (triangles) versus the size 
h/S of the physical grid. Lines are the least-mean square flt of the results. 
A^„ = 8000, r„ = 0.945, Nt = 1024 for S = 0.1; N^ = 8000, r„ = 0.945, 
Nt = 6144 for 5 = 1; N^ = 5832, r„ = 0.970, Nt = 8192 for 5 = 10. 

Fig. 2: Profiles of the dimensionless (a) horizontal mean velocity along the 
vertical line crossing the center of the cavity, and (b) vertical mean velocity 
component along the horizontal line crossing the center of the main vortex. 
Solid and dashed lines: numerical solutions obtained by solving Eq. (7) with 
the semi-regular method for 6 = 10 and S = 0.1, respectively. Circles and 
squares: numerical solutions reported in Ref. [17] for 6 = 10 and 6 = 0.1, 
respectively. N^ = 25600, r^ = 0.996, r^ = 0.998, iV„ = 8000, r^ = 0.945, 
Nt = 1024 for S = 0.1; A^^ = 36864, r^ = 0.990, Vy = 0.995, iV„ = 5832, 
r^ = 0.970, Nt = 8192 for 5 = 10. 

Fig. 3: Drag coefficient over the moving wall, D, versus dimensionless time, 
t/to- 5 = 1. 

Fig. 4: Overall speed-up, S, versus the number of cells in the physical space, 
Nr. 5 = 1, N, = 8000, Nt = 6144. 

Fig. 5: Relative time spent on the streaming step (dark bar) and on collision 
step (light bar). The numbers above the bars refer to the total execution time 
expressed in seconds. S = 1, Ny = 8000, Nt = 6144. 

Fig. 6: GFLOPs versus the number of cells in the physical space, Nr. Solid 
line with circles: streaming step; dashed line with squares: collision step; dot 
and dashed line with triangles: overall code. 6 = 1, N^ = 8000, Nt = 6144. 

Table 1: Drag coefficient, D, and mean fiow rate, G, versus the rarefaction 
parameter, 6. De and Ge represent the estimated exact values. 



Algorithm 1 GPU pseudo-code of the two-dimensional streaming kernel 

Require: Cu^;, Courant number along x direction 

Require: Cuy, Courant number along y direction 

Require: B^, number of threads in x direction 

Require: By, number of threads in y direction 

Require: t^, thread index in x direction within the block 

Require: ty, thread index in y direction within the block 

Require: /sh, matrix {B^ + 1) x [By + 1) in the shared memory 

1: for hy = Ny/By - 1 to do 

2: for h^ = N^/B^ - Ito do 

41 %y i Zy -\- IDyl}yy 

if ty == then 
ii iy — 1 < then 

hsh(tx,ty) ^ boundaryFlux 
else 

end if 
end if 
if tx == then 

if Za; — 1 < then 

hsh{tx,ty) ^— boundaryFlux 
else 

hshitx,ty) <= Kl^^^^i^j 
end if 
end if 
syncthreads 

/irg ^ (l-Cu2;-CUy)/lsh(^x + l,^j/ + l) + CUj//ish(ix,^3; + l) + Cu2;/lsh(^a 

hx ■^ hx — ^ 
end for 

end for 
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Algorithm 2 GPU pseudo-code of the samphng kernel 
Require: i, global thread index in the grid 



1 


V ^BoxMulUer 


2 


vi ^BoxMulller 


3 


k ^UnitSphere 


4 


Vr -i^ Vi — V 


5 


V* ^ V + {Vr ■ k)k 


6 


vl -^ Vi — [Vr ■ k)k 


7 


cells(t;) => I{i) 


8 


cells('i;i) =^ Ii{i) 


9 


cells(t;*) => r{i) 


10 


ce\\s{vl) => I*{i) 


11 


Qj ^ Ticr'^nKAVj^jy^lVr ■ k\At 


12 


g.^ ^ na^nl{AVj^^jX'\Vr ■ k\At 


13 


Qj* ^ 'Ka'^nl{AVj*^j'Y'^\Vr ■ k\At 


14 


9jl ^ 7ra^nl{AVji^Jir'\vr ■ k\At 


15 


9j => C{z) 


16 


93, ^C'i(z) 


17 


9r => C*{t) 


18 


9ji => cm 
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Algorithm 3 GPU pseudo-code of the colhsion kernel 
Require: i, global thread index in the grid 
1: for all j do 
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^ <= K3 



n+l 

p^ p + $oj h 
u ^ u + Vj $oj h 
e ^ e + l^jp $o,j /i 

end for 

for m = 1 to A'f do 

h <= hnim) 
hi <= hTiU 
h* <= h-ikm) 

g ^ h -\- hi + e h hi 
h ^ h — C{m) g 
hi ^ hi — Ci{m) g 
h* ^ h* + C*lm) g 
h*i ^h*i + Cllm) g 

h => htiU 
hi => /i"+^ 



'^ ^ 'y,I*{m) 

hi => /l"+l 



i,Ii(m) 

n+l 

i,I*{ni) 

n+l 

i,/*(m) 



end for 
for all j do 



h <= K,j 



71+1 

an ^ ail + ^o,j h 
ai2 ^ ai2 + Vj $oj h 
ai3 ^ ai3 + li^jp $oj/i 



// others moments of the distribution function 

end for 

[A, B, C]=linearSolver(n, it, e, an, ai2, 013, . . .) 
for all j do 



n+l 

h ^ l/e(l + eh){l + A + B ■ Vj + Cv'j - I) 



h ^ hl^ 
h ^ h^f 



II compute macroscopic quantities 
end for 
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6 


D 


De 


D (Ref. [17]) 


G 


Ge 


G (Ref. [17]) 


0.1 


0.6712 


0.6815 


0.676-0.678 


0.0955 


0.0977 


0.0973-0.0976 


1 


0.6266 


0.6389 


0.625-0.631 


0.1017 


0.1039 


0.104-0.105 


10 


0.4096 


0.4176 


0.412-0.415 


0.1427 


0.1451 


0.145-0.145 



Table 1 



28 



